Whole-genome sequencing provides an opportunity for a comprehensive test enabling detection of clinically relevant genetic aberrations for patients with hematological malignancies. However in diagnostic settings, it is challenging to identify and interpret clinically relevant somatic single nucleotide variants (SNV), small insertions and deletions (INDEL), structural variants (SV) and copy-number variants (CNV) in a manner that retains the sensitivity of the analysis while at the same time enabling a rapid interpretation. SNV and INDEL are identified using the information from within the sequenced reads, and range in size smaller than 10 bp. Both SV and CNV are typically longer than 100 bp and are identified using different properties from aligned reads, which includes split reads, and orientation and distance between the read pairs. These aberrations are further divided into different categories depending upon their complexity and the method of identification, which include duplications (DUP), deletions (DEL), inversions (INV) and translocation breakpoints (BND). Although CNV is defined as an additional type of variant, the copy number gains and losses, identified using read-depth profile, can be categorized as DUP and DEL. All four types of SV arise due to complex rearrangements in the genome. Their inherent complexity combined with the methods used to identify them makes it difficult to obtain a comprehensive set of true biological events. Furthermore, identification of clinically relevant somatic variants requires further classification of all variants into germline and somatic using additional variant-type dependent processing and filtering. Finally, to enable rapid processing that meets the clinical diagnostic demands regarding turnaround times, the entire process needs to be automated with minimum human intervention from sample preparation and bioinformatic analysis all the way through clinical interpretation.

To tackle these challenges, we implemented an ensemble variant calling approach in BALSAMIC. This involves combining computational methods and tools to identify SNV, INDEL and all four types of SV. These variants are then subjected to further bioinformatic processing and rigorous filtering to prepare them for clinical interpretation in our custom-developed variant interpretation software, Scout. The entire workflow is automated using in-house automation software, cg. BALSAMIC utilizes FASTQ files as input and first performs quality control and trimming using FastQC and fastp, respectively. Trimmed reads are mapped to the human reference genome using Sentieon-tools. Duplicated reads are marked and quality controlled using Picard-tools and the results are summarized by MultiQC. SNV and INDEL are called using Sentieon TNscope and TNhaplotyper. The variants are further processed using depth, quality and allele frequency based filters. The filtered SNV and INDEL from TNscope and Tnhaplotyper are merged using bcftools for tumor-only samples. Manta, Delly-SV, and TIDDIT are used to call SV while ascatNgs (tumor-normal) and Delly-CNV are used to call CNV. CNV calls are further converted from CNV to DEL and DUP to curate merged sets for each SV-type across the different SV-callers. The SV calls from Manta, Delly, TIDDIT, and ascatNgs (tumor-normal) are then merged using SVDB. Prior to annotation, SVs with low quality score in the tumor sample, high allelic frequency (possible germline) in the normal sample, or longer than 10Mb are removed. Next, SVs are annotated using Ensembl-VEP and the variants in the intragenic regions are removed. Finally, SVs are uploaded to Scout and those overlapping with clinically relevant genes (defined using an in silico gene list) constitute the final set of clinically relevant somatic SNV, INDEL, SV and CNV.

This workflow has analyzed to date more than two hundred cases of different tumor types mainly including hematological malignancies. This approach identifies thousands of variants across all types, but with rigorous processing and filtering, the identification of clinically relevant somatic variants is done in a manner compatible with the clinical diagnostic demands.

Software availability

The bioinformatics software BALSAMIC, cg and Scout are available from https://github.com/Clinical-Genomics, and the details of the software and methods used in BALSAMIC are available from https://balsamic.readthedocs.io.

Elhami:Clinical Genomics: Ended employment in the past 24 months. Wirta:Roche: Honoraria; Illumina: Honoraria.

Author notes

*

Asterisk with author names denotes non-ASH members.

Sign in via your Institution